Tutorial 2: data integration for mouse thymus Stereo-CITE-seq

In this tutorial, we demonstrate how to apply SpatialGlue to integrate Stereo-CITE-seq (Liao et al.) data to obtain fine-grained clusters. Stereo-CITE-seq co-detects mRNAs and proteins in immune organs. As a example, we analyse a mouse thymus dataset. We collected the data from BGI. According to marker genes and proteins, we manually annotated the tissue to 8 cell types as shown below.

Before running the model, please download the input data via https://drive.google.com/drive/folders/17hDLGDENIds1_I8WlQ_c8tcxkjLIuVFV.

Loading package

[170]:

import os
import torch
import pandas as pd
import scanpy as sc

[171]:

import SpatialGlue

[172]:

# Environment configuration. SpatialGlue pacakge can be implemented with either CPU or GPU. GPU acceleration is highly recommend for imporoved efficiency.
device = torch.device('cuda:2' if torch.cuda.is_available() else 'cpu')

# the location of R, which is required for the 'mclust' clustering algorithm. Please replace the path below with local R installation path
os.environ['R_HOME'] = '/scbio4/tools/R/R-4.0.3_openblas/R-4.0.3'

Loading data

[173]:

# read data
file_fold = '/home/yahui/anaconda3/work/SpatialGlue_revision/data/Dataset3_Mouse_Thymus1/' #please replace 'file_fold' with the download path

adata_omics1 = sc.read_h5ad(file_fold + 'adata_RNA.h5ad')
adata_omics2 = sc.read_h5ad(file_fold + 'adata_protein.h5ad')

adata_omics1.var_names_make_unique()
adata_omics2.var_names_make_unique()

[174]:

# Specify data type
data_type = 'Stereo-CITE-seq'

# Fix random seed
from SpatialGlue.preprocess import fix_seed
random_seed = 2022
fix_seed(random_seed)

Pre-processing data

SpatialGlue adopts standard pre-processing steps for the transcriptomic and protein data. Specifically,for the transcriptomics data,the gene expression counts are log-transformed and normalized by library size via the SCANPY package. The top 3,000 highly variable genes (HVGs) are selected as input of PCA for dimension reduction. To ensure a consistent input dimension with the protein data, the first k (number of proteins) principal components are retained and used as the input of the model. The protein expresssion counts are normliazed using CLR (Centered Log Ratio). All principal components after PCA dimension reduction are used as the input of the model.

[175]:

from SpatialGlue.preprocess import clr_normalize_each_cell, pca

# RNA
sc.pp.filter_genes(adata_omics1, min_cells=10)
sc.pp.filter_cells(adata_omics1, min_genes=80)

sc.pp.filter_genes(adata_omics2, min_cells=50)
adata_omics2 = adata_omics2[adata_omics1.obs_names].copy()

sc.pp.highly_variable_genes(adata_omics1, flavor="seurat_v3", n_top_genes=3000)
sc.pp.normalize_total(adata_omics1, target_sum=1e4)
sc.pp.log1p(adata_omics1)

adata_omics1_high =  adata_omics1[:, adata_omics1.var['highly_variable']]
adata_omics1.obsm['feat'] = pca(adata_omics1_high, n_comps=adata_omics2.n_vars-1)

# Protein
adata_omics2 = clr_normalize_each_cell(adata_omics2)
adata_omics2.obsm['feat'] = pca(adata_omics2, n_comps=adata_omics2.n_vars-1)

Constructing neighbor graph

To preserve the physical proximity between spots and phenotypic proximity of spots which have the same cell types but are spatially non-adjacent to each other, we construct two neighbors graphs, i.e., spatial graph and feature graph. Spatial graph is constructed based on spatial information while feature graph is constructed based on expression profiles.

[176]:

from SpatialGlue.preprocess import construct_neighbor_graph
data = construct_neighbor_graph(adata_omics1, adata_omics2, datatype=data_type)

Training the model

The SpatialGlue model aims to learn an integrated latent representation by adaptively integrating expression profiles of different omics modalities in a spatially aware manner.

After model training, SpatialGlue returns ‘output’ file. The ‘output’ file include multiple output results. Let’s go through each of the results in more detail:

Latent Representations:

‘emb_latent_omics1’: latent representation for the first omics modality.
‘emb_latent_omics2’: latent representation for the second omics modality.
‘SpatialGlue’: joint representation learned by incorporating expression data and spatial location information.

The joint representation can be used for downstream analysis such as clustering, visualization, or identifying differentially expressed genes (DEGs).

Attention Weight Values:

‘alpha_omics1’: intra-modality attention weight for the first omics modality, explaining the contribution of each graph to each cluster.
‘alpha_omics2’: intra-modality attention weight for the second omics modality, explaining the contribution of each graph to each cluster.
‘alpha’: inter-modality attention weight explaining the contribution of each modality to each cluster.

These intra- and inter-modality attention weights provide interpretable insights into the importance of each neighborhood graph and modality to each cluster.

[177]:

# define model
from SpatialGlue.SpatialGlue_pyG import Train_SpatialGlue
model = Train_SpatialGlue(data, datatype=data_type, device=device)

# train model
output = model.train()

  0%|                                                                                                                                                                                | 0/1500 [00:00<?, ?it/s]/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/torch/nn/functional.py:1933: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/SpatialGlue/model.py:212: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  self.alpha = F.softmax(torch.squeeze(self.vu) + 1e-6)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:39<00:00, 37.80it/s]

Model training finished!

[178]:

adata = adata_omics1.copy()
adata.obsm['emb_latent_omics1'] = output['emb_latent_omics1']
adata.obsm['emb_latent_omics2'] = output['emb_latent_omics2']
adata.obsm['SpatialGlue'] = output['SpatialGlue']
adata.obsm['alpha'] = output['alpha']
adata.obsm['alpha_omics1'] = output['alpha_omics1']
adata.obsm['alpha_omics2'] = output['alpha_omics2']

Cross-omics integrative analysis

After integration, we perform clustering analysis using the joint representation. Here we provid three optional kinds of tools for clustering, including mclust, leiden, and louvain. In our experiment, we find ‘mclust’ algorithm performs better than ‘leiden’ and ‘louvain’ on spatial data in most cases. Therefore, we recommend using ‘mclust’ algorithm for clustering.

[179]:

# we set 'mclust' as clustering tool by default. Users can also select 'leiden' and 'louvain'
from SpatialGlue.utils import clustering
tool = 'mclust' # mclust, leiden, and louvain
clustering(adata, key='SpatialGlue', add_key='SpatialGlue', n_clusters=8, method=tool, use_pca=True)

fitting ...
  |======================================================================| 100%

[180]:

# visualization
import matplotlib.pyplot as plt
adata.obsm['spatial'][:,1] = -1*adata.obsm['spatial'][:,1]

fig, ax_list = plt.subplots(1, 2, figsize=(7, 3))
sc.pp.neighbors(adata, use_rep='SpatialGlue', n_neighbors=30)
sc.tl.umap(adata)

sc.pl.umap(adata, color='SpatialGlue', ax=ax_list[0], title='SpatialGlue', s=20, show=False)
sc.pl.embedding(adata, basis='spatial', color='SpatialGlue', ax=ax_list[1], title='SpatialGlue', s=20, show=False)

plt.tight_layout(w_pad=0.3)
plt.show()

/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

_images/Tutorial_2_data_integration_for_mouse_thymus_Stereo-CITE-seq_21_1.png

[181]:

# annotation
adata.obs['SpatialGlue_number'] = adata.obs['SpatialGlue'].copy()
adata.obs['SpatialGlue'].cat.rename_categories({1: '5-Outer cortex region 3(DN T,DP T,cTEC)',
                                                2: '7-Subcapsular zone(DN T)',
                                                3: '4-Middle cortex region 2(DN T,DP T,cTEC)',
                                                4: '2-Corticomedullary Junction(CMJ)',
                                                5: '1-Medulla(SP T,mTEC,DC)',
                                                6: '6-Connective tissue capsule(fibroblast)',
                                                7: '8-Connective tissue capsule(fibroblast,RBC,myeloid)',
                                                8: '3-Inner cortex region 1(DN T,DP T,cTEC)'
                                                }, inplace=True)

/tmp/ipykernel_8958/1140237054.py:3: FutureWarning: The `inplace` parameter in pandas.Categorical.rename_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  adata.obs['SpatialGlue'].cat.rename_categories({1: '5-Outer cortex region 3(DN T,DP T,cTEC)',

[182]:

list_ = ['3-Inner cortex region 1(DN T,DP T,cTEC)','2-Corticomedullary Junction(CMJ)','4-Middle cortex region 2(DN T,DP T,cTEC)',
         '7-Subcapsular zone(DN T)', '5-Outer cortex region 3(DN T,DP T,cTEC)', '8-Connective tissue capsule(fibroblast,RBC,myeloid)',
         '1-Medulla(SP T,mTEC,DC)','6-Connective tissue capsule(fibroblast)']
adata.obs['SpatialGlue']  = pd.Categorical(adata.obs['SpatialGlue'],
                      categories=list_,
                      ordered=True)

[183]:

# plotting with annotation
fig, ax_list = plt.subplots(1, 2, figsize=(9.5, 3))
sc.pp.neighbors(adata, use_rep='SpatialGlue', n_neighbors=30)
sc.tl.umap(adata)

sc.pl.umap(adata, color='SpatialGlue', ax=ax_list[0], title='SpatialGlue', s=10, show=False)
sc.pl.embedding(adata, basis='spatial', color='SpatialGlue', ax=ax_list[1], title='SpatialGlue', s=20, show=False)

ax_list[0].get_legend().remove()

plt.tight_layout(w_pad=0.3)
plt.show()

/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
/home/yahui/anaconda3/envs/STGAT/lib/python3.8/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(

_images/Tutorial_2_data_integration_for_mouse_thymus_Stereo-CITE-seq_24_1.png

[184]:

# Exchange attention weights corresponding to annotations
list_SpatialGlue = [5,4,8,3,1,6,2,7]
adata.obs['SpatialGlue_number']  = pd.Categorical(adata.obs['SpatialGlue_number'],
                      categories=list_SpatialGlue,
                      ordered=True)
adata.obs['SpatialGlue_number'].cat.rename_categories({5:1,
                                                       4:2,
                                                       8:3,
                                                       3:4,
                                                       1:5,
                                                       6:6,
                                                       2:7,
                                                       7:8
                                                }, inplace=True)

/tmp/ipykernel_8958/4089864425.py:6: FutureWarning: The `inplace` parameter in pandas.Categorical.rename_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  adata.obs['SpatialGlue_number'].cat.rename_categories({5:1,

[185]:

# plotting modality weight values.
import pandas as pd
import seaborn as sns
plt.rcParams['figure.figsize'] = (5,3)
df = pd.DataFrame(columns=['RNA', 'protein', 'label'])
df['RNA'], df['protein'] = adata.obsm['alpha'][:, 0], adata.obsm['alpha'][:, 1]
df['label'] = adata.obs['SpatialGlue_number'].values
df = df.set_index('label').stack().reset_index()
df.columns = ['label_SpatialGlue', 'Modality', 'Weight value']
ax = sns.violinplot(data=df, x='label_SpatialGlue', y='Weight value', hue="Modality",
                split=True, inner="quart", linewidth=1, show=False)
ax.set_title('RNA vs protein')
ax.set_xlabel('SpatialGlue label')
ax.legend(bbox_to_anchor=(1.4, 1.01), loc='upper right')

plt.tight_layout(w_pad=0.05)
#plt.show()

_images/Tutorial_2_data_integration_for_mouse_thymus_Stereo-CITE-seq_26_0.png